The GIL Explained

The Benchmark That Breaks Intuition

You have a CPU-bound task. You add threads to go faster. Your program gets slower. This is the GIL.

Here is the benchmark. Run it and see the numbers yourself:

import threading
import time

def count_up(n):
    """CPU-bound: pure Python arithmetic loop."""
    total = 0
    for i in range(n):
        total += i
    return total

N = 50_000_000

# Single-threaded: one thread does all the work
start = time.perf_counter()
count_up(N)
single_elapsed = time.perf_counter() - start
print(f"Single thread:  {single_elapsed:.3f}s")

# Two threads: split the work in half
start = time.perf_counter()
t1 = threading.Thread(target=count_up, args=(N // 2,))
t2 = threading.Thread(target=count_up, args=(N // 2,))
t1.start()
t2.start()
t1.join()
t2.join()
two_thread_elapsed = time.perf_counter() - start
print(f"Two threads:    {two_thread_elapsed:.3f}s")

print(f"Speedup:        {single_elapsed / two_thread_elapsed:.2f}x")

Typical output on CPython 3.11 (8-core machine):

Single thread:  2.847s
Two threads:    3.124s
Speedup:        0.91x

Two threads are slower than one. On a machine with 8 available cores, adding a second thread makes your program 9% slower rather than 2x faster.

This result is not a bug. It is the GIL working exactly as designed.

What the GIL Is

The GIL - Global Interpreter Lock - is a mutex (mutual exclusion lock) that must be held by a Python thread before it can execute any Python bytecode. Only one thread can hold the GIL at a time. Therefore, only one thread executes Python bytecode at any instant, regardless of how many CPU cores are available.

In CPython's source, the GIL lives in the runtime state structure:

// Include/internal/pycore_runtime.h (simplified)
struct _ceval_runtime_state {
    struct _gil_runtime_state {
        // The actual mutex and condition variables
        unsigned long interval;          // Switch interval in microseconds
        PyThread_type_lock mutex;        // The mutex
        PyCOND_T switch_cond;           // Condition variable
        PyThread_type_lock switch_mutex; // Protects switch_cond
        int locked;                      // 1 if GIL is held by some thread
        unsigned long last_holder;       // Thread ID of GIL holder
        _Py_atomic_int eval_breaker;     // Signal to eval loop to check GIL
        _Py_atomic_int gil_drop_request; // Another thread wants the GIL
    } gil;
};

Every time the eval loop executes an opcode, it checks eval_breaker. If set, the current thread drops the GIL momentarily, allowing waiting threads to acquire it, then re-acquires it before continuing.

Why the GIL Was Necessary

CPython manages memory using reference counting. Every PyObject has an ob_refcnt field. When you create a reference, ob_refcnt is incremented; when you destroy a reference, it is decremented; when it hits zero, the object is freed.

On a multi-core machine with multiple threads, ob_refcnt is a shared mutable integer. Without synchronisation, incrementing and decrementing it is a data race:

Thread 1 executing: Py_DECREF(obj)    Thread 2 executing: Py_INCREF(obj)
  read ob_refcnt → 1                    read ob_refcnt → 1
  compute 1 - 1 = 0                     compute 1 + 1 = 2
  write ob_refcnt = 0                   write ob_refcnt = 2
  call tp_dealloc(obj) → FREE MEMORY

  ... Thread 2 now holds a pointer to freed memory ...
  → use-after-free, segfault, or silent memory corruption

Without the GIL, every Py_INCREF and Py_DECREF would need to be an atomic operation (using __atomic_add_fetch or equivalent). In 2023, CPython has over 25,000 calls to Py_INCREF/Py_DECREF in the interpreter and standard library. Making them all atomic is not free:

Atomic operations require CPU memory barriers
Memory barriers flush the CPU's store buffer and invalidate cache lines
On a highly contended multi-core system, this can cost 5-10x more than a non-atomic increment

When Guido van Rossum added threading support to Python in 1992, a single GIL was the pragmatic choice. It made the interpreter thread-safe with minimal per-object overhead and minimal complexity in the C extension API. It worked well because most Python programs were I/O-bound anyway, and for I/O-bound work (covered below) the GIL is released during syscalls.

How the GIL Works: The Eval Breaker

The GIL does not require threads to check a global variable on every opcode - that would be too expensive. Instead, CPython uses an "eval breaker" mechanism:

Thread 1 (holds GIL):                Thread 2 (waiting for GIL):
  executing opcodes ...                 blocked on mutex acquire
  RESUME, LOAD_FAST, BINARY_OP, ...

  (every 5ms OR after N opcodes)
  eval_breaker is set by Thread 2 →    sets eval_breaker flag

  Thread 1 sees eval_breaker:
    drops GIL (mutex unlock)
    signals Thread 2
    waits to re-acquire GIL

                                      Thread 2 acquires GIL:
                                        executes opcodes ...
                                        (for another 5ms)

The switch interval is configurable:

import sys

print(sys.getswitchinterval())    # 0.005 (5 milliseconds, default since 3.2)

# You can change it (rarely useful in practice)
sys.setswitchinterval(0.001)      # Switch every 1ms - more responsive, more overhead
sys.setswitchinterval(0.1)        # Switch every 100ms - less overhead, less fair

# Reset to default
sys.setswitchinterval(0.005)

The check itself happens at the RESUME opcode at function entry points and periodically within loops. The overhead of the check is small - it is a single atomic read of eval_breaker.

GIL Release Points

The GIL is released automatically during:

1. I/O system calls

When a thread calls read(), write(), recv(), send(), or any blocking I/O operation, CPython calls Py_BEGIN_ALLOW_THREADS before the syscall and Py_END_ALLOW_THREADS after:

// Simplified from Modules/socketmodule.c
static PyObject *
sock_recv(PySocketSockObject *s, Py_ssize_t buflen)
{
    char *buf = PyMem_Malloc(buflen);

    Py_BEGIN_ALLOW_THREADS           // ← RELEASES the GIL
    n = recv(s->sock_fd, buf, buflen, flags);
    Py_END_ALLOW_THREADS             // ← RE-ACQUIRES the GIL

    // Back to Python - GIL is held again
    return PyBytes_FromStringAndSize(buf, n);
}

During the recv() syscall, the thread holds no Python GIL. Other Python threads can run freely. This is why threading works well for I/O-bound workloads:

import threading
import urllib.request
import time

urls = [
    "https://httpbin.org/delay/1",
    "https://httpbin.org/delay/1",
    "https://httpbin.org/delay/1",
    "https://httpbin.org/delay/1",
]

def fetch(url):
    urllib.request.urlopen(url).read()

# Sequential: 4 seconds (each fetch takes 1s)
start = time.perf_counter()
for url in urls:
    fetch(url)
print(f"Sequential: {time.perf_counter() - start:.1f}s")   # ~4.0s

# Concurrent threads: ~1 second (GIL released during I/O)
start = time.perf_counter()
threads = [threading.Thread(target=fetch, args=(url,)) for url in urls]
for t in threads: t.start()
for t in threads: t.join()
print(f"Threaded: {time.perf_counter() - start:.1f}s")     # ~1.0s

2. C extensions calling Py_BEGIN_ALLOW_THREADS

Any well-written C extension that does non-Python work (compression, cryptography, matrix operations) can release the GIL. This is why NumPy's core operations are parallelisable:

import numpy as np
import threading
import time

# NumPy releases the GIL during computation
A = np.random.randn(2000, 2000)
B = np.random.randn(2000, 2000)

def matmul():
    return np.dot(A, B)   # Calls BLAS routine - GIL released

# Two threads computing different matrix products concurrently
start = time.perf_counter()
t1 = threading.Thread(target=matmul)
t2 = threading.Thread(target=matmul)
t1.start(); t2.start()
t1.join(); t2.join()
two_thread = time.perf_counter() - start

start = time.perf_counter()
matmul()
one_thread = time.perf_counter() - start

print(f"One thread:  {one_thread:.3f}s")
print(f"Two threads: {two_thread:.3f}s")
# Two threads should be ~1.5-1.8x faster (NumPy uses BLAS which releases GIL)

Pre-3.2 GIL vs the "New GIL"

The original GIL (before Python 3.2) was interval-based on opcode count: every 100 opcodes (the sys.setcheckinterval() default), the current thread dropped the GIL to give others a chance.

David Beazley's 2010 PyCon talk ("Understanding the Python GIL") demonstrated a severe problem with the old scheme: on multi-core machines, two CPU-bound threads competing for the GIL caused massive GIL contention - the OS kept scheduling both threads simultaneously on different cores, causing thousands of failed acquisition attempts per second and enormous signal/mutex overhead. Two threads could be 30% slower than one.

The "new GIL" (Antoine Pitrou, Python 3.2, 2011) replaced opcode counting with a time-based approach using condition variables. Key changes:

The GIL is now time-based: sys.getswitchinterval() defaults to 5ms (0.005 seconds)
A waiting thread sets gil_drop_request and waits on a condition variable for 5ms
If the GIL is not dropped within 5ms, the waiting thread signals the running thread
The running thread responds at the next eval breaker check by dropping the GIL
The previously waiting thread immediately re-acquires

This eliminates the worst-case contention of the old scheme at the cost of some latency (up to 5ms) for a thread waiting to acquire the GIL.

Benchmark: Threads vs Processes for CPU vs I/O Work

import threading
import multiprocessing
import time
import math

def cpu_task(n):
    """CPU-bound: compute sum of square roots."""
    return sum(math.sqrt(i) for i in range(n))

def io_task(n):
    """I/O-bound: simulate with time.sleep."""
    import time
    time.sleep(0.01)  # Simulate 10ms I/O wait
    return n

N = 4  # Number of tasks
WORK = 500_000

# --- CPU-BOUND WORK ---
print("=== CPU-BOUND (sum of square roots) ===")

start = time.perf_counter()
for _ in range(N):
    cpu_task(WORK)
print(f"Sequential:    {time.perf_counter() - start:.3f}s")

start = time.perf_counter()
threads = [threading.Thread(target=cpu_task, args=(WORK,)) for _ in range(N)]
[t.start() for t in threads]
[t.join() for t in threads]
print(f"4 Threads:     {time.perf_counter() - start:.3f}s  (GIL prevents speedup)")

start = time.perf_counter()
with multiprocessing.Pool(N) as pool:
    pool.map(cpu_task, [WORK] * N)
print(f"4 Processes:   {time.perf_counter() - start:.3f}s  (true parallelism)")

# --- I/O-BOUND WORK ---
print("\n=== I/O-BOUND (simulated 10ms wait per task) ===")

start = time.perf_counter()
for _ in range(N):
    io_task(N)
print(f"Sequential:    {time.perf_counter() - start:.3f}s")

start = time.perf_counter()
threads = [threading.Thread(target=io_task, args=(N,)) for _ in range(N)]
[t.start() for t in threads]
[t.join() for t in threads]
print(f"4 Threads:     {time.perf_counter() - start:.3f}s  (GIL released during sleep)")

Typical results (4-core machine):

=== CPU-BOUND ===
Sequential:    1.821s
4 Threads:     1.934s   ← Slightly SLOWER (GIL contention)
4 Processes:   0.503s   ← ~3.6x faster (true parallelism)

=== I/O-BOUND ===
Sequential:    0.040s
4 Threads:     0.011s   ← ~3.6x faster (GIL released during sleep)

Decision rule: CPU-bound → multiprocessing.ProcessPoolExecutor. I/O-bound → threading.ThreadPoolExecutor or asyncio. NumPy/BLAS operations → threads work (GIL released in C).

Python 3.12: Per-Interpreter GIL

Python 3.12 introduced per-interpreter GIL via PEP 684. Previously, all sub-interpreters within a process shared a single GIL. Now each interpreter has its own GIL, enabling true parallelism across interpreters within one process:

# Python 3.12+ only
import _interpreters  # Low-level module (use 'interpreters' in 3.13+)
import sys

# Check if we're using 3.12+
print(f"Python {sys.version_info.major}.{sys.version_info.minor}")

# Per-interpreter GIL enables:
# - Multiple interpreters running concurrently in separate threads
# - Each interpreter has its own GIL, memory state, and module namespace
# - Objects cannot be freely shared between interpreters (must be serialised)

# Example using the higher-level API (3.13+):
# import interpreters
# interp = interpreters.create()
# interp.exec('import time; time.sleep(1)')  # Runs concurrently

# The limitation: no shared memory between interpreters without explicit IPC
# Sub-interpreters are isolated - sharing data requires pickling or shared memory

Per-interpreter GIL is the infrastructure for the upcoming concurrent.interpreters module in Python 3.14, which will provide a higher-level API for parallelism without the multiprocessing process overhead.

Python 3.13: Free-Threaded Mode

Python 3.13 ships with an optional free-threaded build - a CPython variant that removes the GIL entirely. This is the most significant CPython change in 30 years.

To use it, install the free-threaded build:

# On macOS via pyenv
pyenv install 3.13t      # 't' suffix = free-threaded

# On Ubuntu
sudo apt install python3.13-nogil

# Check which build you have
python3 --version         # CPython 3.13.x
python3 -c "import sys; print(sys.flags)"
# Check for 'ignore_environment' -- not the right flag; use:
python3 -c "import sys; print(sys._is_gil_enabled())"  # True = GIL is on

Disable the GIL at runtime:

PYTHON_GIL=0 python3.13 my_script.py

What Changed: Biased Reference Counting

Without the GIL, ob_refcnt is a shared mutable integer accessed from multiple threads simultaneously. The naive fix - make every Py_INCREF/Py_DECREF atomic - is too expensive.

CPython 3.13 uses biased reference counting (from a research paper by Choi et al., 2018):

Each object has TWO reference count components:
  ob_ref_local:  reference count for the owning thread (non-atomic)
  ob_ref_shared: reference count from other threads (atomic)

When the owning thread increments a refcount: non-atomic increment to ob_ref_local
When another thread increments a refcount: atomic increment to ob_ref_shared

When ob_ref_local hits 0: merge ob_ref_shared, then deallocate if total is 0

This means the common case - an object used only by one thread - has no atomic operations. The expensive atomic path is only taken when objects are actually shared across threads.

What Changed: Per-Object Locks

The GIL also protected complex data structure operations that are not naturally atomic (e.g., dict resize, list append during iteration). Without the GIL, these need their own synchronisation.

CPython 3.13 adds per-object locks to mutable built-in types. Each PyDictObject and PyListObject has a small embedded mutex. Critical operations acquire the per-object lock rather than the global GIL.

Current Performance Characteristics

# This is what the free-threaded build currently delivers (as of 3.13):
# - Single-threaded: ~20-40% slower than the GIL build
#   (biased refcounting + per-object locking overhead even in single-thread)
# - Multi-threaded CPU-bound: approaching linear scaling (2 threads ≈ 1.8x faster)
# - Memory: higher (biased refcount fields + per-object lock fields)

# Check if GIL is enabled at runtime
import sys
if hasattr(sys, '_is_gil_enabled'):
    print(f"GIL enabled: {sys._is_gil_enabled()}")
else:
    print("GIL is always on (standard build)")

The 20-40% single-threaded overhead is the current cost of the additional per-object and refcount bookkeeping. The CPython core team expects this to decrease over subsequent releases as optimisations are applied.

Working Around the GIL Today

While free-threaded Python matures, production systems use these strategies:

`multiprocessing.ProcessPoolExecutor`

from concurrent.futures import ProcessPoolExecutor
import math
import time

def cpu_work(n):
    return sum(math.sqrt(i) for i in range(n))

data = [500_000] * 8  # 8 tasks

start = time.perf_counter()
with ProcessPoolExecutor(max_workers=8) as executor:
    results = list(executor.map(cpu_work, data))
print(f"ProcessPoolExecutor: {time.perf_counter() - start:.3f}s")
# True parallelism across 8 processes
# Overhead: process creation + data pickling between processes

`concurrent.futures.ThreadPoolExecutor` for I/O

from concurrent.futures import ThreadPoolExecutor
import urllib.request
import time

def fetch(url):
    with urllib.request.urlopen(url, timeout=10) as r:
        return len(r.read())

urls = ["https://httpbin.org/bytes/10000"] * 8

start = time.perf_counter()
with ThreadPoolExecutor(max_workers=8) as executor:
    results = list(executor.map(fetch, urls))
print(f"ThreadPoolExecutor: {time.perf_counter() - start:.3f}s")
# GIL released during I/O → true concurrency

Cython `nogil` for C Extensions

# myextension.pyx
from cython cimport nogil

def compute(double[:] arr):
    cdef int i
    cdef double total = 0.0
    # Release GIL for pure C computation
    with nogil:
        for i in range(len(arr)):
            total += arr[i]
    return total

NumPy (Already Releases GIL)

import numpy as np
import threading
import time

# NumPy's C layer releases the GIL - threading works
def process(arr):
    return np.sum(arr ** 2)   # Releases GIL in C BLAS/MKL layer

arrays = [np.random.randn(1_000_000) for _ in range(4)]

start = time.perf_counter()
threads = [threading.Thread(target=process, args=(a,)) for a in arrays]
[t.start() for t in threads]
[t.join() for t in threads]
print(f"NumPy threads: {time.perf_counter() - start:.3f}s")

Interview Q&A

Q1: What is the GIL and why does CPython have one?

The GIL (Global Interpreter Lock) is a mutex in CPython's runtime that must be held by any thread before it can execute Python bytecode. At most one thread holds the GIL at any time, meaning Python bytecode executes single-threaded even on multi-core hardware.

The GIL exists because CPython uses reference counting for memory management. Reference count operations (ob_refcnt++ and ob_refcnt--) are not atomic. Without the GIL, two threads simultaneously decrementing the same object's reference count could both read the current count, both subtract 1, both get a non-zero result, and never free the object - or worse, both reach zero and double-free it, causing a crash or memory corruption. The GIL prevents these races by ensuring only one thread runs at a time, making reference counting operations effectively atomic without the overhead of atomic CPU instructions on every increment and decrement.

The historical reason it was not removed earlier: removing the GIL while preserving single-threaded performance requires redesigning the memory management model from the ground up, which is what Python 3.13's free-threaded build has done.

Q2: Does the GIL prevent all race conditions in multithreaded Python programs?

No, and this is a critical misunderstanding. The GIL only protects CPython's internal state - the reference counts, the type objects, the interpreter's data structures. It does not protect application-level data.

Consider: counter += 1 is not atomic even with the GIL. It compiles to LOAD_GLOBAL counter, LOAD_CONST 1, BINARY_OP +, STORE_GLOBAL counter - four opcodes. The GIL can be released between any two of these opcodes (at the 5ms switch interval). If two threads execute this concurrently, both may read the same initial value, both add 1, and both write back - the counter increments by 1 instead of 2.

Any data shared between threads - lists, dicts, counters, flags - requires explicit synchronisation with threading.Lock, threading.RLock, or threading.Event. The queue.Queue class is thread-safe because it uses an internal lock. Use threading.local() for thread-local storage to avoid sharing data at all.

Q3: Why does adding threads make a CPU-bound Python program slower rather than faster?

Two effects combine to make CPU-bound threaded Python slower:

First, the GIL prevents true parallelism. Only one thread runs Python bytecode at a time. Two threads each doing N iterations of work takes at least as long as one thread doing 2N iterations - the GIL serialises them.

Second, GIL contention adds overhead. When two threads compete for the GIL, the OS schedules both of them on separate cores simultaneously. The waiting thread repeatedly attempts to acquire the GIL, sets gil_drop_request, and waits on a condition variable. The running thread, at its next eval breaker check, releases the GIL. The OS wakes the waiting thread, the threads do a context switch, and the cycle continues every 5ms. This constant contention - mutex locking, condition variable signalling, kernel context switches - adds 5-15% overhead compared to a single thread that never contends.

The correct tools for CPU-bound parallelism in Python are multiprocessing.ProcessPoolExecutor (separate processes with separate GILs) or libraries like NumPy that release the GIL during computation.

Q4: What changed in the "new GIL" introduced in Python 3.2?

The original GIL (pre-3.2) dropped the GIL every 100 bytecode instructions (sys.setcheckinterval() default). David Beazley's 2010 benchmark showed that on multi-core systems, this caused severe contention: the OS would schedule two threads simultaneously on two cores, both constantly trying to acquire the GIL. The result was thousands of failed acquisition attempts per second, excessive syscall overhead, and CPU-bound programs running 40-50% slower with two threads than one.

The new GIL (Antoine Pitrou, 3.2) is time-based rather than instruction-count-based. A waiting thread sets a gil_drop_request flag and starts a 5ms timer. If the current holder does not drop the GIL within 5ms, the waiter signals it via a condition variable. The holder drops the GIL at the next eval breaker check point (function entry, loop back-edge). The new design also prevents "convoy" effects where the dropping thread immediately re-acquires the GIL before the waiting thread can run, by using a condition variable that the waiter signals to request priority. This eliminated the worst-case contention while maintaining reasonable fairness.

Q5: What is Python 3.13's free-threaded mode? Is it production-ready?

Python 3.13 ships a separate free-threaded build (python3.13t or the --disable-gil compile flag) that removes the GIL. In free-threaded mode, multiple threads can execute Python bytecode truly in parallel on multiple cores.

The key technical changes: (1) Biased reference counting - each object has a local refcount (non-atomic, fast) and a shared refcount (atomic, used when the object crosses thread boundaries); this avoids atomic operations in the common single-thread case. (2) Per-object locks - built-in mutable types (dict, list) have embedded locks for critical operations, replacing the GIL's blanket protection. (3) Immortal objects - True, False, None, and small integers never have their refcount modified at all, eliminating contention for these universally shared objects.

Production readiness as of Python 3.13 (March 2025): experimental, not recommended for production. Single-threaded performance is 20-40% slower than the GIL build. Many C extensions are not yet thread-safe without the GIL (they assumed single-threaded bytecode execution). The Py_GIL_DISABLED preprocessor flag lets C extension authors add the needed synchronisation. The CPython team expects the single-threaded overhead to decrease to under 5% in subsequent releases as the internals are optimised for the no-GIL model.

The Benchmark That Breaks Intuition​

What the GIL Is​

Why the GIL Was Necessary​

How the GIL Works: The Eval Breaker​

GIL Release Points​

Pre-3.2 GIL vs the "New GIL"​

Benchmark: Threads vs Processes for CPU vs I/O Work​

Python 3.12: Per-Interpreter GIL​

Python 3.13: Free-Threaded Mode​

What Changed: Biased Reference Counting​

What Changed: Per-Object Locks​

Current Performance Characteristics​

Working Around the GIL Today​

multiprocessing.ProcessPoolExecutor​

concurrent.futures.ThreadPoolExecutor for I/O​

Cython nogil for C Extensions​

NumPy (Already Releases GIL)​

Interview Q&A​